Learning Check
When you complete this session, you will be able to:
- understand the assumptions in SLR;
- understand how to critically assessing a regression model;
- perform regression diagnostics checking for Simple Linear
Regression;
- understand outliers and leverage;
- perform some simple transformations.
Before you start, load the stored datasets provided for exercises
below by running the following chunk (assuming the file
“InsulatingData.Rdata” is saved in the same directory as the .Rmd
worksheet).
Question 1 The timing of production runs (Sheather, 2009)
The original data are in the form of the time taken (in minutes) for
a production run, \(Y\), and the number
of items produced, \(X\), for 20
randomly selected orders as supervised by three managers. At this stage
we shall only consider the data for one of the managers.
(a). Open the Excel file production.txt. Construct a
scatterplot of production run, \(Y\),
and the number of items produced, \(X\). Fit the least squares line to the data
using R. Explain the fitted model, by interpreting the slope and
intercept.
production=read.table("production.txt", header=T)
production
prod.lm=lm(RunTime ~ RunSize, data=production) ## the LS line
prod.lm1=lm(RunTime ~ 10*RunSize, data=production)
Error in terms.formula(formula, data = data) :
invalid model formula in ExtractVars
The equation of the least squares line of best fit is
\[y = 149.7 + 0.26x.\]
The intercept is 149.7, which is where the line of best fit crosses
the run time axis. The intercept in the model has the following
interpretation: for any production run, the average set up time is 149.7
minutes.
The slope of the line is 0.26. Thus, we say that each additional unit
to be produced is predicted to add 0.26 minutes to the run time.
The value of \(R^2= 0.7302\) which
implies that 73.02% of the information (variation) in Run Time is
explained by the least squares regression line, suggesting that the
model is a good model.
(b). Construct residual diagnostic plots, and assess whether you
think that the assumptions for linear regression have been satisfied.
You can use the function rstudent (or
rstandard) to first standardize the residuals.
First, we manually check the normality through histograms and
QQ-plots of residuals and standardised residuals.
names(prod.lm) ## checking the output of the fitted model, residuals
[1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign" "qr"
[8] "df.residual" "xlevels" "call" "terms" "model"
res=prod.lm$residuals
std.res=rstandard(prod.lm) ## standardised residuals
par(mfrow=c(2,2)) ## plotting 4 plots to check normality
hist(res)
hist(std.res)
qqnorm(std.res)
qqline(std.res)

From the above histograms and QQ-plots, the normality is
satisfied.
par(mfrow=c(2,2))
plot(res, xlab="Time", ylab="Residuals") ## Residuals vs Time
plot(std.res,xlab="Time", ylab="Standardised Residuals")
plot(production$RunSize,res, xlab="Run Size", ylab="Residuals") # Residuals vs x
plot(production$RunSize,std.res, xlab="Run Size", ylab="Standardised Residuals")

- The next residual analysis is to check about randomness (independence) and constant variance. We plot both
residuals and standardised residuals against time (top left and top right). From these plots we can confirm the
the residual are random scattered with no pattern whatsoever.
- The bottom left plot shows the residuals against Run Size to check randomness (no pattern) and constant
variance. The bottom right plot provides the standardised residuals against fitted values to check randomness (no
pattern) and constant variance. Both plots confirm randomness and constant variance.
(c). Using the fitted model.
par(mfrow=c(2,2)) # to view 2x2 plots
plot(prod.lm)


The figure above provides diagnostic plots produced by R when the
command plot(prod.lm) is used, where prod.lm
is the result of the “lm” command.
+ The top left plot shows the residuals against fitted values to check randomness (no pattern) and linearity.
+ The bottom left provides the standardised residuals against fitted values to check randomness (no pattern) and constant variance. The smoothing curve using standardised residuals is almost flat (i.e constant) to confirm randomness and constant variance.
+ The top right plot is a normal Q–Q plot. Most of the observations lie around the straight line, confirming Normality.
+ The bottom right plot of standardized residuals against leverage enables one
to readily identify any ‘bad’ leverage points.
- Recall that the rule for simple linear regression for classifying a point as a leverage
point is if leverage $h_{ii} > 4/ n.$ As $n=20$, the rule is $h_{ii} > 0.2.$ As leverage $h_{ii}$ in the above
bottom right is less than 0.2, then there is no leverage points.
- Recall that we classify points as outliers if their standardized residuals have absolute value greater than
2. All the residuals are within (-2,2), therefore no outliers being detected.
Question 2 The invoices data (Sheather, 2009)
The manager of the purchasing department of a large company would
like to develop a regression model to predict the average amount of time
it takes to process a given number of invoices. Over a 30-day period,
data are collected on the number of invoices processed and the total
time taken (in hours). The data are available in the file
invoices.txt.
Complete the following tasks.
(a). Fit the least squares line to the data using R. Explain the
fitted model, by interpreting the slope and intercept.
invoices=read.table("invoices.txt", header=T)
View(invoices)
inv.lm=lm(Time ~ Invoices, data=invoices)
plot(Time ~ Invoices, data=invoices)
abline(inv.lm)

Call:
lm(formula = Time ~ Invoices, data = invoices)
Residuals:
Min 1Q Median 3Q Max
-0.59516 -0.27851 0.03485 0.19346 0.53083
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.6417099 0.1222707 5.248 1.41e-05 ***
Invoices 0.0112916 0.0008184 13.797 5.17e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3298 on 28 degrees of freedom
Multiple R-squared: 0.8718, Adjusted R-squared: 0.8672
F-statistic: 190.4 on 1 and 28 DF, p-value: 5.175e-14
The equation of the least squares line of best fit is
\[ Time= 0.642 + 0.011 Invoices
.\]
The intercept is 0.642, which is not interpretable for no
invoices.
The slope of the line is 0.011. Thus, we say for that for each
additional invoice, the average amount of time it takes to process
increases by 0.011 hour.
(b). Construct residual diagnostic plots, and assess whether you
think that the assumptions for linear regression have been satisfied.
You can use the function rstudent (or
rstandard) to first standardize the residuals.
First, we check the normality through histograms and QQ-plots of
residuals and standardised residuals.
names(inv.lm) ## checking the output of the fitted model, residuals
[1] "coefficients" "residuals" "effects" "rank" "fitted.values"
[6] "assign" "qr" "df.residual" "xlevels" "call"
[11] "terms" "model"
res=inv.lm$residuals
std.res=rstandard(inv.lm) ## standardised residuals
par(mfrow=c(2,2)) ## plotting 4 plots to checkk normality
hist(res)
hist(std.res)
qqnorm(std.res)
qqline(std.res)

The histograms show bimodalities. From the QQ-plots, the normality
seems to be satisfied, except 1 point at the top and 1 point at the
bottom respectively.
par(mfrow=c(2,2))
plot(res, xlab="Time", ylab="Residuals")
plot(std.res,xlab="Time", ylab="Standardised Residuals")
plot(res, invoices$Invoices, xlab="Invoices", ylab="Residuals")
plot(std.res, invoices$Invoices, xlab="Invoices", ylab="Standardised Residuals")

The next residual analysis is to check about randomness
(independence) and constant variance. We plot both residuals and
standardised residuals against time (top left and top right). From these
plots we can see a bit of increasing pattern, small residuals within
0-15 and then larger residuals within 15-30.
The bottom left plot shows the residuals against Invoices to check
randomness (no pattern) and constant variance. The bottom right plot
provides the standardised residuals against fitted values to check
randomness (no pattern) and constant variance. Both plots show a bit of
pattern between negative and positive Invoices.
(c). Perform diagnostic checking for the fitted model in (a) using
“plot(file.lm).” Interpret the outputs.
par(mfrow=c(2,2))
plot(inv.lm)

The top left plot shows the residuals against fitted values to
check randomness (no pattern) and linearity.
The bottom left provides the standardised residuals against
fitted values to check randomness (no pattern) and constant variance.
The smoothing curve using standardised residuals is a curve which means
that independence and constant variance may not be satisfied.
The top right plot is a normal Q–Q plot. Most of the observations
lie around the straight line except the two points, confirming
Normality.
The bottom right plot of standardized residuals against leverage
enables one to readily identify any ‘bad’ leverage points.
- Recall that the rule for simple linear regression for classifying a point as a leverage
point is if leverage $h_{ii} > 4/ n.$ As $n=30$, the rule is $h_{ii} > 0.13.$ As leverage $h_{ii}$ in the above
bottom right plot has one point with $h_{ii} > 0.13$ , then there is one leverage point.
- Recall that we classify points as outliers if their standardized residuals have
absolute value greater than 2. All the residuals are within (-2,2), therefore no outliers being detected. The
above leverage point is a `good` leverage point.
(d). Use the function influence.measures to explore
measures of leverage and Cook’s distance.
# check help function first
Inv.infl<-influence.measures(inv.lm)
Inv.infl
Influence measures of
lm(formula = Time ~ Invoices, data = invoices) :
Points increase in influence to the extent that they lie on their
own, a long way from the mean of \(x\)
and the bulk of the data. The column “hat” in the above is the commonest
measure of this leverage and is defined by:
\[h_i = \frac{1}{n} +
\frac{(x_i-\bar{x})^2}{\sum(x_j -\bar{x})^2} .\]
Cook’s distance (the column “cook.d” in the above) is one measure of
influence: for each point, it combines the size of its residual along
with a measure of the leverage. An approximate cutoff is \(4/(n-2)\), but in practice it is important
to look for gaps in values of Cook’s distance instead of just whether or
not the values exceed the cutoff.
There are 2 functions which can also be used to calculate these
measures directly: hatvalues and
cooks.distance.
Leverage.inv<-hatvalues(inv.lm)
Cooks.Dist.inv<-cooks.distance(inv.lm)
Inv <- invoices$Invoices
# plotting the influence measures against their x value - Invoices
n=30
plot(Leverage.inv~Inv)
abline(h=4/n, col=3)
plot(Cooks.Dist.inv~Inv)
abline(h=4/(n-2), col=2)
Recall that we classify points as outliers if their standardized
residuals have absolute value greater than 2. All the residuals are
within (-2,2), therefore no outliers being detected. The above leverage
point is a good leverage point.
Cooks Distance identifies one potential outlier, observation
number 30. However, it is important to look for gaps in values of Cook’s
distance instead of just whether or not the values exceed the
cutoff.
# plotting the influence measures against their x value - Invoices
n=30
plot(Leverage.inv~Inv)
abline(h=4/n, col=3)

plot(Cooks.Dist.inv~Inv)
abline(h=4/(n-2), col=2)

28 26 1 11 24 25 16 15
0.03348526 0.03395333 0.03554890 0.03580511 0.03783426 0.04002682 0.04268500 0.04470347
12 29 9 7 3 17 22 21
0.04695762 0.04875109 0.04933167 0.05065542 0.05402803 0.05548070 0.05552997 0.05775210
2 13 5 27 6 20 8 10
0.06354063 0.06354063 0.06435114 0.06435114 0.06529058 0.07786621 0.08542440 0.09496328
30 18 14 4 23 19
0.09620163 0.09863069 0.10127820 0.10389038 0.10917143 0.18897091
Question 3 (From Weisberg, S. (2005). Applied Linear
Regression, 3rd edition. New York: Wiley)
In this question, you will be learning the R code in more detail to
provide better visualisation which may be needed for your project.
Furthermore, the ANOVA is explained in more detail following the lecture
(part c).
The R data file Wind.RData contains a data
frame (wm1) containing wind speed data (in m/s) at two
sites: a reference site (Rspd) and a candidate site
(CSpd). Data were collected every 6 hours for the year 2002
except in the month of May. The objective is to be able to determine
whether wind speed at the reference site, which has permanent wind speed
monitoring equipment, can be used to predict the wind speed at this
candidate site in the future in order to determine whether to place a
small wind farm there.
10 16 20 8 9 14
0.0002035386 0.0005353096 0.0010595498 0.0011024438 0.0032903108 0.0033298164
17 28 29 4 1 18
0.0036431399 0.0045459002 0.0059479355 0.0061181410 0.0088295724 0.0090501347
7 5 24 25 11 6
0.0092145747 0.0150936105 0.0174505677 0.0201627039 0.0260821401 0.0302309933
21 12 22 26 19 27
0.0308161815 0.0341277200 0.0387338505 0.0423707984 0.0502287555 0.0510541088
3 2 15 13 23 30
0.0599026174 0.0770062759 0.0797769665 0.0898041145 0.1207897984 0.1525784387
- Draw a scatterplot of the response
CSpd against the
predictor RSpd and label it appropriately.
par(pty = "s") # A graphical parameter that sets the PlotTYpe to "square"
# The syntax of the plot statement says "plot CSpd against RSpd but extract them
# from the data frame wm1. The xlim and ylim arguments are the same - the range
# of all the data.
plot(CSpd ~ RSpd, data = wm1,
xlab = "Wind speed at reference site (m/s)",
ylab = "Wind speed at candidate site (m/s)",
main = "Wind speeds at reference and candidate sites",
xlim = range(c(wm1$CSpd, wm1$RSpd)), ylim = c(range(c(wm1$CSpd, wm1$RSpd))))

[1] 1116
Fit a simple linear model to these data, and present the appropriate
regression summaries. Plot the fitted line onto the plot in (a).
# Load the workshop data here
load("Wind.RData")
nrow('Wind.RData')
NULL
To plot the fitted line onto an existing plot in R or RStudio,
you only need to invoke the abline command (from the
console) that we’ve used already and that’s shown below. Here, however,
because I want to produce a separate plot from the one above, I have to
re-plot the data using the above commands and then invoke
abline.
# Notice the similarity with the plot statement above
Wind.lm <- lm(CSpd ~ RSpd, data = wm1)
# Here are the contents of the object Wind.lm:
names(Wind.lm)
[1] "coefficients" "residuals" "effects" "rank" "fitted.values"
[6] "assign" "qr" "df.residual" "xlevels" "call"
[11] "terms" "model"
# and here's some basic output:
Wind.lm
Call:
lm(formula = CSpd ~ RSpd, data = wm1)
Coefficients:
(Intercept) RSpd
3.1412 0.7557
# A bit more comes from the summary() function:
summary(Wind.lm)
Call:
lm(formula = CSpd ~ RSpd, data = wm1)
Residuals:
Min 1Q Median 3Q Max
-7.7877 -1.5864 -0.1994 1.4403 9.1738
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.14123 0.16958 18.52 <2e-16 ***
RSpd 0.75573 0.01963 38.50 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.466 on 1114 degrees of freedom
Multiple R-squared: 0.5709, Adjusted R-squared: 0.5705
F-statistic: 1482 on 1 and 1114 DF, p-value: < 2.2e-16
- Construct the four diagnostic plots that we discussed in the
lecture. Do you think the simple linear model is a good fit? What
improvements might you suggest?
par(pty = "s")
plot(CSpd ~ RSpd, data = wm1,
xlab = "Wind speed at reference site (m/s)",
ylab = "Wind speed at candidate site (m/s)",
main = "Wind speeds at reference and candidate sites",
xlim = range(c(CSpd, RSpd)), ylim = c(range(c(CSpd, RSpd))))
# abline() takes two arguments: a slope and an intercept. We can
# extract these from the object Wind.lm (which is a list) by
# using the $ operator, e.g., Wind.lm$coefficients (try it!) and
# using that as an argument to abline():
abline(Wind.lm$coef) # or simply abline(Wind.lm)

The interpretation:
- Plot 1 - Residuals vs Fitted: There is no clear pattern, eventhough large fitted values seem to have smaller residuals.
- Plot 2 - Scale - Location: There is no clear pattern, eventhough large fitted values seem to have smaller residuals.
- Plot 3 - QQ plot of standardised residuals: while most of the points are on the straight line, some points on the upper part are off the lines. Normality might not be satisfied.
- Plot 4 - Residuals vs Leverage: As $n=1116$ then the cut-off for Cook's distance is $D_i > 4/(n-2)=0.0036.$ We can see that many of the standardised residuals with the leverage greater than 0.04.
Question 4: House selling price data
A set of data on the selling price \(Y\) (in $10,000) and annual taxes \(X\) (in $1,000) for 24 houses is available
available online, {housesellingprice.txt}.
- Assuming that a simple linear regression model is appropriate,
obtain the least squares fit relating selling price to taxes paid. What
is the estimate of \(\sigma^2\)?
Read data in
par(mfrow=c(2,2)) # Places 4 graphs together on the same plotting region
plot(Wind.lm)

Simple Linear regression fit
house = read.table("housesellingprice.txt",header=T)
summary(house)
sales taxes
Min. :25.90 Min. :3.891
1st Qu.:29.90 1st Qu.:5.057
Median :33.70 Median :5.974
Mean :34.61 Mean :6.405
3rd Qu.:38.15 3rd Qu.:7.873
Max. :45.80 Max. :9.142
\(\widehat\beta_0=13.3202\), \(\widehat\beta_1=3.3244\)
house.lm<-lm(sales~taxes,data=house)
summary(house.lm)
Call:
lm(formula = sales ~ taxes, data = house)
Residuals:
Min 1Q Median 3Q Max
-3.8343 -2.3157 -0.3669 1.9787 6.3168
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.3202 2.5717 5.179 3.42e-05 ***
taxes 3.3244 0.3903 8.518 2.05e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.961 on 22 degrees of freedom
Multiple R-squared: 0.7673, Adjusted R-squared: 0.7568
F-statistic: 72.56 on 1 and 22 DF, p-value: 2.051e-08
\(\widehat\sigma^2={Residual
MS}=8.77\)
- Find the mean selling price given that the taxes paid are \(7.50\).
Analysis of Variance Table
Response: sales
Df Sum Sq Mean Sq F value Pr(>F)
taxes 1 636.16 636.16 72.556 2.051e-08 ***
Residuals 22 192.89 8.77
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
the mean selling price \(=38.25296\)
that the taxes paid are \(X_0=7.50\).
- Construct a graph of \(Y\) versus
\(X\). Then add the fitted line, the
95% confidence interval of the mean response, and the 95% prediction
interval of the new actual values on the graph.
predict(house.lm,newdata=data.frame(taxes=7.5),interval="confidence",level=0.95)
fit lwr upr
1 38.25296 36.71776 39.78816
- Perform diagnostic checking for the fitted model in (a) using
“plot(file.lm).” Interpret the outputs.
with(house,plot(sales~taxes,type="p"))
with(house,lines(fitted(house.lm)~taxes))
new = data.frame(taxes=seq(3,10,.1))
CIs = predict(house.lm,new,interval="confidence")
PIs = predict(house.lm,new,interval="predict")
matpoints(new$taxes,CIs,lty=c(1,2,2),col=c("black","red","red"),type="l")
matpoints(new$taxes,PIs,lty=c(1,2,2),col=c("black","blue","blue"),type="l")

The top left plot shows the residuals against fitted values to
check randomness (no pattern) and linearity. The smoothing curve shows a
reasonable linear relationship and no pattern.
The bottom left provides the standardised residuals against
fitted values to check randomness (no pattern) and constant variance.
The smoothing curve using standardised residuals is a curve which means
that independence and constant variance may not be satisfied.
The top right plot is a normal Q–Q plot. Most of the observations
lie around the straight line except the two points, confirming
Normality.
The bottom right plot of standardized residuals against leverage
enables one to readily identify any ‘bad’ leverage points.
par(mfrow=c(2,2))
plot(house.lm)

# check help function first
house.lev<-hatvalues(house.lm)
house.lev
1 2 3 4 5 6 7 8
0.08009597 0.07494802 0.10189804 0.10097003 0.07310359 0.15145530 0.04613071 0.05281317
9 10 11 12 13 14 15 16
0.04744471 0.06286388 0.04197728 0.04511789 0.07355860 0.10057696 0.04314772 0.07471120
17 18 19 20 21 22 23 24
0.16214717 0.04466605 0.06413614 0.14091382 0.04346584 0.10811699 0.09396602 0.17177488
- Recall that the rule for simple linear regression for classifying a point as a leverage
point is if leverage $h_{ii} > 4/ n.$ As $n=24$, the rule is $h_{ii} > 0.167.$ As leverage $h_{ii}$ in the
above output shows that one point (observation no 24) with $h_{ii}=0.17177 > 0.167$ , then there is one
leverage point.
- Recall that we classify points as outliers if their standardized residuals have
absolute value greater than 2. There is one observation (no 15) with the standard residual is greater than
2. However observation number 24, the leverage point, with standard residual less than 2 is a good
leverage point.
Question 5 Bid data (Sheather, 2009)
A building maintenance company is planning to submit a bid on a
contract to clean corporate offices scattered throughout an office
complex. The costs incurred by the maintenance company are proportional
to the number of cleaning crews needed for this task. Recent data are
available for the number of rooms that were cleaned by varying numbers
of crews. For a sample of 53 days, records were kept of the number of
crews used and the number of rooms that were cleaned by those crews. The
data can be found in the file cleaning.txt.
The aim is to develop a regression equation to model the relationship
between the number of rooms cleaned and the number of crews.
Complete the following tasks.
(a). Fit the least squares line to the data using R. Explain the
fitted model, by interpreting the slope and intercept.
std.res.h=rstandard(house.lm)
#std.res.h
sort(std.res.h)
14 1 23 11 8 4 22
-1.36539093 -1.32680727 -1.23490307 -1.09307472 -1.05824009 -0.91550077 -0.79182376
12 7 3 2 19 5 13
-0.77096776 -0.70102563 -0.18618750 -0.17949347 -0.17219414 -0.08436379 -0.03798314
10 17 16 20 24 9 6
0.19524305 0.19658735 0.45732239 0.70746820 0.77548911 1.10892170 1.33622425
18 21 15
1.47498387 1.50462011 2.18088723
bid=read.table("cleaning.txt", header=T)
head(bid)
The equation of the least squares line of best fit is
\[Rooms = 1.78 + 3.70 Crews.\]
The intercept is \(1.78 \approx 2\)
which is where the line of best fit crosses the Rooms axis. The
intercept in the model has the following interpretation: for Crews=0,
the average of number of cleaned room is 2.
The slope of the line is \(3.70 \approx
4\). Thus, we say that for each additional crew, it is predicted
to add 4 rooms are being cleaned.
(b). Construct residual diagnostic plots, and assess whether you
think that the assumptions for linear regression have been satisfied.
You can use the function rstudent (or
rstandard) to first standardize the residuals.
First, we manually check the normality through histograms and
QQ-plots of residuals and standardised residuals.
bid.lm=lm(Rooms ~ Crews, data=bid) ## the LS line
plot(Rooms ~ Crews, data=bid, xlab="Number of Crews", ylab="Number of Rooms")
abline(bid.lm)

Call:
lm(formula = Rooms ~ Crews, data = bid)
Residuals:
Min 1Q Median 3Q Max
-15.9990 -4.9901 0.8046 4.0010 17.0010
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.7847 2.0965 0.851 0.399
Crews 3.7009 0.2118 17.472 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.336 on 51 degrees of freedom
Multiple R-squared: 0.8569, Adjusted R-squared: 0.854
F-statistic: 305.3 on 1 and 51 DF, p-value: < 2.2e-16
From the above histograms and QQ-plots, the normality is
satisfied.
res.bid=bid.lm$residuals
std.res.bid=rstandard(bid.lm) ## standardised residuals
par(mfrow=c(2,2)) ## plotting 4 plots to check normality
hist(res.bid)
hist(std.res.bid)
qqnorm(res.bid)
qqline(res.bid)
qqnorm(std.res.bid)
qqline(std.res.bid)

- The next residual analysis is to check about randomness (independence) and constant variance. We plot both
residuals and standardised residuals against time (top left and top right). From these plots we can confirm the
the residual are random scattered with no pattern whatsoever. However, there seems to show an increasing trend for variances.
- The bottom left plot shows the residuals against Number of Crews to check randomness (no pattern) and constant variance. The bottom right plot provides the standardised residuals against fitted values to check randomness (no pattern) and constant variance.
- It is clear that the variability in the standardized residuals tends to increase with the number of crews. Thus, the assumption that the variance of the errors is constant appears to be violated in this case.
(c). Perform diagnostic checking for the fitted model in (a) using
“plot(file.lm).” Interpret the outputs.
par(mfrow=c(2,2))
plot(res.bid, xlab="Time", ylab="Residuals") ## Residuals vs Time
plot(std.res.bid,xlab="Time", ylab="Standardised Residuals")
plot(bid$Crews,res.bid, xlab="Number of Crews", ylab="Residuals") # Residuals vs x
plot(bid$Crews,std.res.bid, xlab="Number of Crews", ylab="Standardised Residuals")

- The top left plot shows the residuals against fitted values to check randomness (no pattern) and linearity. The smoothing curve shows a reasonable linear relationship and no pattern.
- The bottom left provides the squared root of standardised residuals against fitted values to check randomness (no pattern) and constant variance. The smoothing curve using standardised residuals is an increasing trend which means that constant variance may not be satisfied.
- The top right plot is a normal Q–Q plot. Most of the observations lie around the straight line except the two points, confirming Normality.
- The bottom right plot of standardized residuals against leverage enables one to readily identify any ‘bad’ leverage points (ie large leverage and large standardised residuals)
(d). As the data on each axis are in the form of counts and are
measured in the same units, the square root transformation of both the
predictor variable and the response variable should be implemented.
Apply these transformations and repeat (a).
par(mfrow=c(2,2))
plot(bid.lm)

(e). Perform diagnostic checking for the fitted model in (d) using
“plot(file.lm).” Interpret the outputs.
bid.sq.lm=lm(sqrt(Rooms) ~ sqrt(Crews), data=bid) ## the LS line
plot(sqrt(Rooms) ~ sqrt(Crews), data=bid, xlab="Square Root of Number of Crews", ylab="Square Root of Number of Rooms")
abline(bid.sq.lm)

Call:
lm(formula = sqrt(Rooms) ~ sqrt(Crews), data = bid)
Residuals:
Min 1Q Median 3Q Max
-1.09825 -0.43988 0.06826 0.42726 1.20275
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2001 0.2757 0.726 0.471
sqrt(Crews) 1.9016 0.0936 20.316 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.594 on 51 degrees of freedom
Multiple R-squared: 0.89, Adjusted R-squared: 0.8879
F-statistic: 412.7 on 1 and 51 DF, p-value: < 2.2e-16
- After the transformation, the bottom left-hand plot further demonstrates the
benefit of the square root transformation in terms of stabilizing the error term.
- Thus, taking the square root of both the x and the y variables has stabilized
the variance of the random errors and hence produced a valid model.
(f). Predict the number of rooms that can be cleaned by 4 crews and
by 16 crews and its 95% confidence intervals using the models in (a) and
(d) respectively.
Using (a)
par(mfrow=c(2,2))
plot(bid.sq.lm)

Using (d)
conf.pred0=predict(bid.lm,newdata=data.frame(Crews=c(4,16)),interval="prediction",level=0.95)
conf.pred0
fit lwr upr
1 16.58827 1.58941 31.58713
2 60.99899 45.81025 76.18773
- We can see that the prediction interval based on the transformed data is narrower
than that based on the untransformed data when the number of crews is 4 (7.78,
27.21) vs (1.58, 31.59) and wider when the number of crews is 16 [(43.33, 81.55) is
wider than (45.81, 76.19)].
- This is expected as the original scale the data have variance which increases as
the x -variable increases which means that realistic prediction intervals will
get wider as the x -variable increases.
- In summary, ignoring nonconstant variance in the raw data from this example led to
invalid prediction intervals.
---
title: "STAT2401 Analysis of Experiments"
author: '**Practical Week 6: Solutions**'
#date: "Practical 2"
output:
  html_document:
    highlight: haddock
    # number_sections: yes
    theme: flatly
    toc: yes
  html_notebook:
    highlight: haddock
    # number_sections: yes
    theme: flatly
  pdf_document:
    toc: yes
  word_document:
    highlight: tango
    toc: yes
---


```{r echo=FALSE}
knitr::opts_chunk$set(prompt=FALSE, comment=NA, tidy=TRUE, error=TRUE, warning=FALSE, message=FALSE)
```


## Learning Check

When you complete this session, you will be able to:

1. understand the assumptions in SLR;
2. understand how to critically assessing a regression model;
3. perform regression diagnostics checking for Simple Linear Regression;
4. understand outliers and leverage;
5. perform some simple transformations.


Before you start, load the stored datasets provided for exercises below by running the following chunk (assuming the file  "InsulatingData.Rdata" is saved in the same directory as the .Rmd  worksheet).

## Question 1 The timing of production runs (Sheather, 2009)

The original data are in the form of the time taken (in minutes) for a production run, $Y$, and the number of items produced,
$X$, for 20 randomly selected orders as supervised by three managers. At this stage we shall only consider the data for one of the managers.

(a). Open the Excel file *production.txt*. Construct a scatterplot of production run, $Y$, and the number of items produced, $X$. Fit the least squares line to the data using R. Explain the fitted model, by interpreting the slope and intercept.

```{r}
production=read.table("production.txt", header=T)
production
```

```{r}
prod.lm=lm(RunTime ~ RunSize, data=production) ## the LS line
prod.lm1=lm(RunTime ~ RunSize, data=production)
plot(RunTime ~ RunSize, data=production, xlab="Run Size", ylab="Run Time")
abline(prod.lm)
summary(prod.lm)
```

The equation of the least squares line of best fit is

$$y = 149.7 + 0.26x.$$

The intercept is 149.7, which is where the line of best fit crosses the run time axis. The
intercept in the model has the following interpretation: for any production run, the
average set up time is 149.7 minutes.

The slope of the line is 0.26. Thus, we say that each
additional unit to be produced is predicted to add 0.26 minutes to the run time. 

The value of $R^2=  0.7302$ which implies that 73.02% of the information (variation) in Run Time is explained by the least squares regression line, suggesting that the model is a good model.

(b).  Construct residual diagnostic plots, and assess whether you think that the assumptions for linear regression have been satisfied. You can use the function `rstudent` (or `rstandard`) to first standardize the residuals.


First, we manually check the *normality through histograms and QQ-plots of residuals and standardised residuals.*

```{r}
names(prod.lm)  ## checking the output of the fitted model, residuals
res=prod.lm$residuals
std.res=rstandard(prod.lm)  ## standardised residuals
par(mfrow=c(2,2))  ## plotting 4 plots to check normality
hist(res)
hist(std.res)
qqnorm(res)
qqline(res)
qqnorm(std.res)
qqline(std.res)
par(mfrow=c(1,1))
```

From the above histograms and QQ-plots, the normality is satisfied.

```{r}
par(mfrow=c(2,2))
plot(res, xlab="Time", ylab="Residuals") ## Residuals vs Time
plot(std.res,xlab="Time", ylab="Standardised Residuals")
plot(production$RunSize,res, xlab="Run Size", ylab="Residuals") # Residuals vs x
plot(production$RunSize,std.res,  xlab="Run Size", ylab="Standardised Residuals")
```

     - The next residual analysis is to check about randomness (independence) and constant variance. We plot both
     residuals and standardised residuals against time (top left and top right). From these plots we can confirm the
     the residual are random scattered with no pattern whatsoever.

     - The bottom left plot shows the residuals against Run Size to check randomness (no pattern) and constant
     variance. The bottom right plot provides the standardised residuals against fitted values to check randomness (no
     pattern) and constant variance. Both plots confirm randomness and constant variance.

(c). Using the fitted model.

```{r}
par(mfrow=c(2,2)) # to view 2x2 plots
plot(prod.lm)
plot(prod.lm1)
```

The figure above provides diagnostic plots produced by R when the command `plot(prod.lm)` is used, where `prod.lm` is the result of the “lm” command. 

    + The top left plot shows the residuals against fitted values to check randomness (no pattern) and linearity. 

    + The bottom left provides the standardised residuals against fitted values to check randomness (no pattern) and constant variance. The smoothing curve using standardised residuals is almost flat (i.e constant) to confirm randomness and constant variance.

    + The top right plot is a normal Q–Q plot. Most of the observations lie around the straight line, confirming Normality.

    + The bottom right plot of standardized residuals against leverage enables one
    to readily identify any ‘bad’ leverage points. 

       - Recall that the rule for simple linear regression for classifying a point as a leverage
       point is if leverage $h_{ii} > 4/ n.$ As $n=20$, the rule is $h_{ii} > 0.2.$ As leverage $h_{ii}$ in the above
       bottom right is less than 0.2, then there is no leverage points.

       - Recall that we classify points as outliers if their standardized residuals have absolute value greater than
       2. All the residuals are within (-2,2), therefore no outliers being detected.


## Question 2 The invoices data (Sheather, 2009)

The manager of the purchasing department of a large company would like to
develop a regression model to predict the average amount of time it takes to process a given number of invoices. Over a 30-day period, data are collected on the number of invoices processed and the total time taken (in hours). The data are available in the file *invoices.txt.* 

Complete the following tasks.

(a). Fit the least squares line to the data using R. Explain the fitted model, by interpreting the slope and intercept.

```{r}
invoices=read.table("invoices.txt", header=T)
View(invoices)
```

```{r}
inv.lm=lm(Time ~ Invoices, data=invoices)
plot(Time ~ Invoices, data=invoices)
abline(inv.lm)
summary(inv.lm)
```

The equation of the least squares line of best fit is

$$ Time= 0.642 + 0.011 Invoices .$$

The intercept is 0.642, which is not interpretable for no invoices.

The slope of the line is 0.011. Thus, we say for that for each
additional invoice, the average amount of time it takes to process increases by 0.011 hour.

(b). Construct residual diagnostic plots, and assess whether you think that the assumptions for linear regression have been satisfied. You can use the function `rstudent` (or `rstandard`) to first standardize the residuals.

First, we check the normality through histograms and QQ-plots of residuals and standardised residuals.

```{r}
names(inv.lm)  ## checking the output of the fitted model, residuals
res=inv.lm$residuals
std.res=rstandard(inv.lm)  ## standardised residuals
par(mfrow=c(2,2))  ## plotting 4 plots to checkk normality
hist(res)
hist(std.res)
qqnorm(res)
qqline(res)
qqnorm(std.res)
qqline(std.res)
par(mfrow=c(1,1))
```

The histograms show bimodalities. From the QQ-plots, the normality seems to be satisfied, except 1 point at the top and 1 point at the bottom respectively.

```{r}
par(mfrow=c(2,2))
plot(res, xlab="Time", ylab="Residuals")
plot(std.res,xlab="Time", ylab="Standardised Residuals")
plot(res, invoices$Invoices, xlab="Invoices", ylab="Residuals")
plot(std.res, invoices$Invoices,  xlab="Invoices", ylab="Standardised Residuals")
```

The next residual analysis is to check about randomness (independence) and constant variance. We plot both residuals and standardised residuals against time (top left and top right). From these plots we can see a bit of increasing pattern, small residuals within 0-15 and then larger residuals within 15-30.

The bottom left plot shows the residuals against Invoices to check randomness (no pattern) and constant variance. The bottom right plot provides the standardised residuals against fitted values to check randomness (no pattern) and constant variance. Both plots show a bit of pattern between negative and positive Invoices.

(c). Perform diagnostic checking for the fitted model in (a) using "plot(file.lm)." Interpret the outputs.


```{r}
par(mfrow=c(2,2))
plot(inv.lm)
```


+ The top left plot shows the residuals against fitted values to check randomness (no pattern) and linearity. 

+ The bottom left provides the standardised residuals against fitted values to check randomness (no pattern) and constant variance. The smoothing curve using standardised residuals is a curve which means that independence and constant variance may not be satisfied.

+ The top right plot is a normal Q–Q plot. Most of the observations lie around the straight line except the two points, confirming Normality.

+ The bottom right plot of standardized residuals against leverage enables one
to readily identify any ‘bad’ leverage points. 

       - Recall that the rule for simple linear regression for classifying a point as a leverage
       point is if leverage $h_{ii} > 4/ n.$ As $n=30$, the rule is $h_{ii} > 0.13.$ As leverage $h_{ii}$ in the above
       bottom right plot has one point with $h_{ii} > 0.13$ , then there is one leverage point.

      - Recall that we classify points as outliers if their standardized residuals have
      absolute value greater than 2. All the residuals are within (-2,2), therefore no outliers being detected. The
      above leverage point is a `good` leverage point.

(d). Use the function `influence.measures` to explore measures of leverage and Cook's distance.   

```{r}
# check help function first
Inv.infl<-influence.measures(inv.lm)
Inv.infl
#head(Inv.infl$infmat)
```

Points increase in influence to the extent that they lie on their own, a long way from the mean of $x$ and the bulk of the data. The column "hat" in the above is the commonest measure of this leverage and is defined by:

$$h_i = \frac{1}{n} + \frac{(x_i-\bar{x})^2}{\sum(x_j -\bar{x})^2} .$$


Cook’s distance (the column "cook.d" in the above) is one measure of influence: for each point, it combines the size of its residual along with a measure of the leverage. An approximate cutoff is $4/(n-2)$, but in practice it is important to look for gaps in values of Cook’s distance instead of just whether or not the values exceed the cutoff.

There are 2 functions which can also be used to calculate these measures directly: `hatvalues` and `cooks.distance`.

```{r}
Leverage.inv<-hatvalues(inv.lm)
Cooks.Dist.inv<-cooks.distance(inv.lm)
Inv <- invoices$Invoices
```


```{r, fig3, fig.width = 5, fig.asp = .5}
# plotting the influence measures against their x value - Invoices 
n=30
plot(Leverage.inv~Inv)
abline(h=4/n, col=3)
plot(Cooks.Dist.inv~Inv)
abline(h=4/(n-2), col=2)
```

- Recall that we classify points as outliers if their standardized residuals have
absolute value greater than 2. All the residuals are within (-2,2), therefore no outliers being detected. The above leverage point is a `good` leverage point.

- Cooks Distance identifies one potential outlier, observation number 30. However, it is important to look for gaps in values of Cook’s distance instead of just whether or not the values exceed the cutoff.



```{r}
sort(Leverage.inv)
```

```{r}
sort(Cooks.Dist.inv)
```


## Question 3 (From Weisberg, S. (2005). *Applied Linear Regression*, 3rd edition. New York: Wiley)

In this question, you will be learning the R code in more detail to provide better visualisation which may be needed for your project. Furthermore, the ANOVA is explained in more detail following the lecture (part c).

The *R* data file `Wind.RData` contains a data frame (`wm1`) containing wind speed data (in m/s) at two sites: a reference site (`Rspd`) and a candidate site (`CSpd`). Data were collected every 6 hours for the year 2002 except in the month of May. The objective is to be able to determine whether wind speed at the reference site, which has permanent wind speed monitoring equipment, can be used to predict the wind speed at this candidate site in the future in order to determine whether to place a small wind farm there.

```{r}
# Load the workshop data here
load("Wind.RData")
nrow('Wind.RData')
```


a. Draw a scatterplot of the response `CSpd` against the predictor `RSpd` and label it appropriately. 

```{r PlotData, fig.height=7}
par(pty = "s") # A graphical parameter that sets the PlotTYpe to "square"

# The syntax of the plot statement says "plot CSpd against RSpd but extract them 
# from the data frame wm1. The xlim and ylim arguments are the same - the range 
# of all the data. 
plot(CSpd ~ RSpd, data = wm1, 
     xlab = "Wind speed at reference site (m/s)", 
     ylab = "Wind speed at candidate site (m/s)",
     main = "Wind speeds at reference and candidate sites",
     xlim = range(c(wm1$CSpd, wm1$RSpd)), ylim = c(range(c(wm1$CSpd, wm1$RSpd)))) 

nrow(wm1)
```

Fit a simple linear model to these data, and present the appropriate regression summaries. Plot the fitted line onto the plot in (a).

```{r}
# Notice the similarity with the plot statement above
Wind.lm <- lm(CSpd ~ RSpd, data = wm1)
# Here are the contents of the object Wind.lm:
names(Wind.lm)
# and here's some basic output:
Wind.lm

# A bit more comes from the summary() function:
summary(Wind.lm)
```

  *To plot the fitted line onto an existing plot in R or RStudio, you only need to invoke the `abline` command (from the console) that we've used already and that's shown below. Here, however, because I want to produce a separate plot from the one above, I have to re-plot the data using the above commands and then invoke `abline`.*
  
```{r fig.height=7}
par(pty = "s")
plot(CSpd ~ RSpd, data = wm1, 
     xlab = "Wind speed at reference site (m/s)", 
     ylab = "Wind speed at candidate site (m/s)",
     main = "Wind speeds at reference and candidate sites",
     xlim = range(c(CSpd, RSpd)), ylim = c(range(c(CSpd, RSpd)))) 
# abline() takes two arguments: a slope and an intercept. We can
# extract these from the object Wind.lm (which is a list) by 
# using the $ operator, e.g., Wind.lm$coefficients (try it!) and
# using that as an argument to abline():
abline(Wind.lm$coef) # or simply abline(Wind.lm)
```  


b. Construct the four diagnostic plots that we discussed in the lecture. Do you think the simple linear model is a good fit? What improvements might you suggest?

```{r}
par(mfrow=c(2,2))  # Places 4 graphs together on the same plotting region
plot(Wind.lm)
```

The interpretation:

         - Plot 1 - Residuals vs Fitted: There is no clear pattern, eventhough large fitted values seem to have smaller residuals.

         - Plot 2 - Scale - Location: There is no clear pattern, eventhough large fitted values seem to have smaller residuals.

         - Plot 3 - QQ plot of standardised residuals: while most of the points are on the straight line, some points on the upper part are off the lines. Normality might not be satisfied.

         - Plot 4 - Residuals vs Leverage: As $n=1116$ then the cut-off for Cook's distance is $D_i > 4/(n-2)=0.0036.$ We can see that many of the standardised residuals with the leverage greater than 0.04. 
         

## Question 4: House selling price data

A set of data on the selling price $Y$ (in \$10,000) and annual taxes $X$ (in \$1,000) for 24
houses is available available online, {housesellingprice.txt}.


a. Assuming that a simple linear regression model is appropriate,
obtain the least squares fit relating selling price to taxes paid.
What is the estimate of $\sigma^2$?


Read data in

```{r size="scriptsize"}
house = read.table("housesellingprice.txt",header=T)
summary(house)
```

Simple Linear regression fit

```{r size="scriptsize"}
house.lm<-lm(sales~taxes,data=house)
summary(house.lm)
```

$\widehat\beta_0=13.3202$, $\widehat\beta_1=3.3244$

```{r size="scriptsize"}
anova(house.lm)
```

$\widehat\sigma^2={Residual MS}=8.77$


b. Find the mean selling price given that the taxes paid are $7.50$.


```{r size="scriptsize"}
predict(house.lm,newdata=data.frame(taxes=7.5),interval="confidence",level=0.95)
```

the mean selling price $=38.25296$ that the taxes paid are $X_0=7.50$.


c. Construct a graph of $Y$ versus $X$. Then add the fitted line,
the 95\% confidence interval of the mean response, and the 95\%
prediction interval of the new actual values on the graph.


```{r size="scriptsize"}
with(house,plot(sales~taxes,type="p"))
with(house,lines(fitted(house.lm)~taxes))
new = data.frame(taxes=seq(3,10,.1))
CIs = predict(house.lm,new,interval="confidence")
PIs = predict(house.lm,new,interval="predict")
matpoints(new$taxes,CIs,lty=c(1,2,2),col=c("black","red","red"),type="l")
matpoints(new$taxes,PIs,lty=c(1,2,2),col=c("black","blue","blue"),type="l")
```


d. Perform diagnostic checking for the fitted model in (a) using “plot(file.lm).” Interpret the outputs.

```{r}
par(mfrow=c(2,2))
plot(house.lm)
```


+ The top left plot shows the residuals against fitted values to check randomness (no pattern) and linearity. 
The smoothing curve shows a reasonable linear relationship and no pattern.

+ The bottom left provides the standardised residuals against fitted values to check randomness (no pattern) and constant variance. The smoothing curve using standardised residuals is a curve which means that independence and constant variance may not be satisfied.

+ The top right plot is a normal Q–Q plot. Most of the observations lie around the straight line except the two points, confirming Normality.

+ The bottom right plot of standardized residuals against leverage enables one
to readily identify any ‘bad’ leverage points. 

```{r}
# check help function first
house.lev<-hatvalues(house.lm)
house.lev
```
```{r}
std.res.h=rstandard(house.lm)
#std.res.h
sort(std.res.h)
```

        - Recall that the rule for simple linear regression for classifying a point as a leverage
        point is if leverage $h_{ii} > 4/ n.$ As $n=24$, the rule is $h_{ii} > 0.167.$ As leverage $h_{ii}$ in the  
        above output shows that one point (observation no 24) with $h_{ii}=0.17177 > 0.167$ , then there is one
        leverage point.

        - Recall that we classify points as outliers if their standardized residuals have
        absolute value greater than 2. There is one observation (no 15) with the standard residual is greater than  
        2. However observation number 24, the leverage point,  with standard residual less than 2 is a good
        leverage point.
        
        
## Question 5 Bid data (Sheather, 2009)

A building maintenance company is planning to submit a bid on a contract to clean corporate
offices scattered throughout an office complex. The costs incurred by the maintenance company are
proportional to the number of cleaning crews needed for this task. Recent data are available for the number
of rooms that were cleaned by varying numbers of crews. For a sample of 53 days, records were kept of the 
number of crews used and the number of rooms that were cleaned by those crews.
The data can be found in the file *cleaning.txt*.

The aim is to develop a regression equation to model the relationship between the 
number of rooms cleaned and the number of crews.

Complete the following tasks.

(a). Fit the least squares line to the data using R. Explain the fitted model, by interpreting the slope and intercept.

```{r}
bid=read.table("cleaning.txt", header=T)
head(bid)
```

```{r}
bid.lm=lm(Rooms ~ Crews, data=bid) ## the LS line
plot(Rooms ~ Crews, data=bid, xlab="Number of Crews", ylab="Number of Rooms")
abline(bid.lm)
summary(bid.lm)
```

The equation of the least squares line of best fit is

$$Rooms = 1.78 + 3.70 Crews.$$


The intercept is $1.78 \approx 2$ which is where the line of best fit crosses the Rooms axis. The
intercept in the model has the following interpretation: for Crews=0, the
average of number of cleaned room is 2.

The slope of the line is $3.70 \approx 4$. Thus, we say that for each
additional crew, it is predicted to add 4 rooms are being cleaned. 

(b). Construct residual diagnostic plots, and assess whether you think that the assumptions for linear regression have been satisfied. You can use the function `rstudent` (or `rstandard`) to first standardize the residuals.

First, we manually check the *normality through histograms and QQ-plots of residuals and standardised residuals.*

```{r}
res.bid=bid.lm$residuals
std.res.bid=rstandard(bid.lm)  ## standardised residuals
par(mfrow=c(2,2))  ## plotting 4 plots to check normality
hist(res.bid)
hist(std.res.bid)
qqnorm(res.bid)
qqline(res.bid)
qqnorm(std.res.bid)
qqline(std.res.bid)
par(mfrow=c(1,1))
```

From the above histograms and QQ-plots, the normality is satisfied.

```{r}
par(mfrow=c(2,2))
plot(res.bid, xlab="Time", ylab="Residuals") ## Residuals vs Time
plot(std.res.bid,xlab="Time", ylab="Standardised Residuals")
plot(bid$Crews,res.bid, xlab="Number of Crews", ylab="Residuals") # Residuals vs x
plot(bid$Crews,std.res.bid,  xlab="Number of Crews", ylab="Standardised Residuals")
```

     - The next residual analysis is to check about randomness (independence) and constant variance. We plot both
     residuals and standardised residuals against time (top left and top right). From these plots we can confirm the
     the residual are random scattered with no pattern whatsoever. However, there seems to show an increasing trend for variances.

     - The bottom left plot shows the residuals against Number of Crews to check randomness (no pattern) and constant variance. The bottom right plot provides the standardised residuals against fitted values to check randomness (no pattern) and constant variance. 
  
    - It is clear that  the variability in the standardized residuals tends to increase  with the number of crews. Thus, the assumption that the variance of the errors is constant appears to be violated in this case. 

(c). Perform diagnostic checking for the fitted model in (a) using "plot(file.lm)." Interpret the outputs.

```{r}
par(mfrow=c(2,2))
plot(bid.lm)
```

       -  The top left plot shows the residuals against fitted values to check randomness (no pattern) and linearity.  The smoothing curve shows a reasonable linear relationship and no pattern.

       - The bottom left provides the squared root of standardised residuals against fitted values to check randomness (no pattern) and constant variance. The smoothing curve using standardised residuals is an increasing trend which means that constant variance may not be satisfied.

        - The top right plot is a normal Q–Q plot. Most of the observations lie around the straight line except the two points, confirming Normality.

        -  The bottom right plot of standardized residuals against leverage enables one to readily identify any ‘bad’ leverage points (ie large leverage and large standardised residuals) 

(d). As the  data  on each axis are  in the form of counts and are measured in the same units, the square 
root transformation of both the predictor variable and the response variable should be implemented. Apply 
these transformations and repeat (a).


```{r}
bid.sq.lm=lm(sqrt(Rooms) ~ sqrt(Crews), data=bid) ## the LS line
plot(sqrt(Rooms) ~ sqrt(Crews), data=bid, xlab="Square Root of Number of Crews", ylab="Square Root of Number of Rooms")
abline(bid.sq.lm)
summary(bid.sq.lm)
```

(e). Perform diagnostic checking for the fitted model in (d) using "plot(file.lm)." Interpret the outputs.


```{r}
par(mfrow=c(2,2))
plot(bid.sq.lm)
```


         - After the transformation, the bottom left-hand plot further demonstrates the 
         benefit of the square root transformation in terms of stabilizing the error term.  
         
         - Thus, taking the square root of both the  x  and the  y  variables has stabilized
         the variance of the random errors and hence produced a valid model.  


(f). Predict the number of rooms that can be cleaned by 4 crews and by 16 crews and its 95% confidence intervals using the models in (a) and (d) respectively.

Using (a)

```{r}
conf.pred0=predict(bid.lm,newdata=data.frame(Crews=c(4,16)),interval="prediction",level=0.95)
conf.pred0
```

Using (d)

```{r}
conf.pred0.sq=predict(bid.sq.lm,newdata=data.frame(Crews=c(4,16)),interval="prediction",level=0.95)
conf.pred0.sq^2  ## as this was square root
```

       - We can see that the prediction interval based on the transformed data is narrower 
       than that  based on the untransformed data when the number of crews is 4 (7.78, 
       27.21) vs (1.58, 31.59) and wider when the number of crews is 16 [(43.33, 81.55) is 
       wider than (45.81, 76.19)]. 

       - This is expected as the original scale the data have variance which increases as 
       the  x -variable increases which means that realistic prediction intervals will
       get wider as the  x -variable increases.  

      - In summary, ignoring nonconstant variance in the raw data from this example led to 
      invalid prediction intervals. 
      
      